The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.
Object recognition
The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.
● All the features are geometric features extracted from the silhouette.
● All are numeric in nature.
● Exploratory Data Analysis
● Reduce number dimensions in the dataset with minimal information loss
● Train a model using Principle Components
Apply dimensionality reduction technique – PCA and train a model using principle components instead of training the model using just the raw data.
● Book on PCA ● Application of PCA for image compression
# Importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
# Reading the dataset
data = pd.read_csv('vehicle-1.csv')
data.sample(5)
# Shape of the data is 846 rows and 19 columns
data.shape
# There are some missing values in some of the rows.
data.info()
# Let's check the data distribution to decide how to impute the missing values
data.describe()
# Lets treat the missing values and replace by the median of the column
from sklearn.impute import SimpleImputer
# Remove the target class variable before imputation and store it for future use
y = data['class']
X = data.drop(axis=1, columns=['class'])
cols = X.columns
impute = SimpleImputer(strategy='median')
X = pd.DataFrame(impute.fit_transform(X), columns=cols)
X.info()
corr = X.corr()
corr
# Let's check the heatmap to better visualize the correlation coefficients
plt.figure(figsize=(15,10))
sns.heatmap(corr, annot=True)
sns.pairplot(X, diag_kind='kde')
To avoid multicolleniarity issue, one should drop the columns. However, in the later part of the project, we want to do Principal component analysis. So, its better to retain all original variables and capture the maximum variance possible in the PCA itself.
# Decided not to drop the columns.
# X.drop(columns=['circularity', 'scatter_ratio', 'elongatedness', 'pr.axis_rectangularity'], inplace=True)
# X.shape
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=6)
X_train.head()
from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train, y_train)
print('Train set Accuracy: ', svc.score(X_train, y_train))
y_pred = svc.predict(X_test)
print('Test set Accuracy: ', svc.score(X_test, y_test))
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(svc, X_test, y_test, values_format='g')
from sklearn.model_selection import cross_validate, cross_val_score, cross_val_predict, GridSearchCV
param = {
'C' : [1.0, 2.0, 10.0],
'gamma' : ['scale', 'auto']
}
grid = GridSearchCV(svc, param, cv=5, refit=True)
grid.fit(X_train, y_train)
print('Train set Accuracy: ', grid.score(X_train, y_train))
print('Test set Accuracy: ', grid.score(X_test, y_test))
plot_confusion_matrix(grid, X_test, y_test, values_format='0.0f')
Need to scale the data before applying Principal Component Analysis
from scipy.stats import zscore
X_scaled = X.apply(zscore)
X_scaled.head()
X_scaled.shape
from sklearn.decomposition import PCA
pca = PCA(n_components=14, random_state=10)
pca.fit(X_scaled)
# Checking the explained variance by each Principal component,
# it looks like the first 4 components explain the majority of the variance in the data
pca.explained_variance_
# Checking the Explained variance ratio
pca.explained_variance_ratio_
# Plotting a bar graph to visualize the comparitive explained variance
plt.figure(figsize=(10,6))
plt.xlabel('Principal Components')
plt.ylabel('Explained variance Ratio')
plt.bar(list(range(1,15)),pca.explained_variance_ratio_);
# Plotting a step graph to view the cummulative explained variance
plt.figure(figsize=(10,6))
plt.xlabel('Principal Components')
plt.ylabel('Cummulative Explained variance')
plt.step(list(range(1,15)),np.cumsum(pca.explained_variance_ratio_),where='mid');
np.cumsum(pca.explained_variance_ratio_)
From the above Step graph and the above cummulative explained variance numbers, it appears that 7 Principal components are required to explain more than 95% of the total variance in the dataset
pca7 = PCA(n_components=7)
pca7.fit(X_scaled)
Xpca7 = pd.DataFrame(pca7.transform(X_scaled))
Xpca7.head()
sns.pairplot(Xpca7);
# Generating the new Train Test set using the same random state.
X_train, X_test, y_train, y_test = train_test_split(Xpca7, y, test_size=0.3, random_state=6)
X_train.head()
from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train, y_train)
print('Train set Accuracy: ', svc.score(X_train, y_train))
y_pred = svc.predict(X_test)
print('Test set Accuracy: ', svc.score(X_test, y_test))
plot_confusion_matrix(svc, X_test, y_test, values_format='0.0f')
from sklearn.model_selection import cross_validate, cross_val_score, cross_val_predict, GridSearchCV
param = {
'C' : [1.0, 2.0, 10.0],
'gamma' : ['scale', 'auto']
}
grid = GridSearchCV(svc, param, cv=5, refit=True)
grid.fit(X_train, y_train)
print('Train set Accuracy: ', grid.score(X_train, y_train))
print('Test set Accuracy: ', grid.score(X_test, y_test))
plot_confusion_matrix(grid, X_test, y_test, values_format='0.0f')